Goto

Collaborating Authors

 wrong label




Large Scale Passenger Detection with Smartphone/Bus Implicit Interaction and Multisensory Unsupervised Cause-effect Learning

Servizi, Valentino, Persson, Dan R., Pereira, Francisco C., Villadsen, Hannah, Bækgaard, Per, Rich, Jeppe, Nielsen, Otto A.

arXiv.org Artificial Intelligence

Intelligent Transportation Systems (ITS) underpin the concept of Mobility as a Service (MaaS), which requires universal and seamless users' access across multiple public and private transportation systems while allowing operators' proportional revenue sharing. Current user sensing technologies such as Walk-in/Walk-out (WIWO) and Check-in/Check-out (CICO) have limited scalability for large-scale deployments. These limitations prevent ITS from supporting analysis, optimization, calculation of revenue sharing, and control of MaaS comfort, safety, and efficiency. We focus on the concept of implicit Be-in/Be-out (BIBO) smartphone-sensing and classification. To close the gap and enhance smartphones towards MaaS, we developed a proprietary smartphone-sensing platform collecting contemporary Bluetooth Low Energy (BLE) signals from BLE devices installed on buses and Global Positioning System (GPS) locations of both buses and smartphones. To enable the training of a model based on GPS features against the BLE pseudo-label, we propose the Cause-Effect Multitask Wasserstein Autoencoder (CEMWA). CEMWA combines and extends several frameworks around Wasserstein autoencoders and neural networks. As a dimensionality reduction tool, CEMWA obtains an auto-validated representation of a latent space describing users' smartphones within the transport system. This representation allows BIBO clustering via DBSCAN. We perform an ablation study of CEMWA's alternative architectures and benchmark against the best available supervised methods. We analyze performance's sensitivity to label quality. Under the naïve assumption of accurate ground truth, XGBoost outperforms CEMWA. Although XGBoost and Random Forest prove to be tolerant to label noise, CEMWA is agnostic to label noise by design and provides the best performance with an 88\% F1 score.



Title

Author

Neural Information Processing Systems

We prove that early learning and memorization are fundamental phenomena in high-dimensional classification tasks, even in simple linear models, and give a theoretical explanation in this setting.



Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels

Wang, Ke, Ortiz-Jimenez, Guillermo, Jenatton, Rodolphe, Collier, Mark, Kokiopoulou, Efi, Frossard, Pascal

arXiv.org Artificial Intelligence

Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e.g., +6.8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.


Practical considerations for Machine Learning Classification - AskSid

#artificialintelligence

There is something very satisfying when you build a machine learning classifier using a toy dataset. We can achieve high accuracy and feel good inside while doing it. But this doesn't really help us or prepare us for real-world datasets and the issues it poses. If you have ever trained a machine learning classification model, you may have come across this issue. People use different words for it. 'Imbalanced dataset', 'Model is Skewed', etc. Let's say we are training a model to detect spam emails.


Early-Learning Regularization Prevents Memorization of Noisy Labels

Liu, Sheng, Niles-Weed, Jonathan, Razavian, Narges, Fernandez-Granda, Carlos

arXiv.org Machine Learning

We propose a novel framework to perform classification via deep learning in the presence of noisy annotations. When trained on noisy labels, deep neural networks have been observed to first fit the training data with clean labels during an "early learning" phase, before eventually memorizing the examples with false labels. We prove that early learning and memorization are fundamental phenomena in high-dimensional classification tasks, even in simple linear models, and give a theoretical explanation in this setting. Motivated by these findings, we develop a new technique for noisy classification tasks, which exploits the progress of the early learning phase. In contrast with existing approaches, which use the model output during early learning to detect the examples with clean labels, and either ignore or attempt to correct the false labels, we take a different route and instead capitalize on early learning via regularization. There are two key elements to our approach. First, we leverage semi-supervised learning techniques to produce target probabilities based on the model outputs. Second, we design a regularization term that steers the model towards these targets, implicitly preventing memorization of the false labels. The resulting framework is shown to provide robustness to noisy annotations on several standard benchmarks and real-world datasets, where it achieves results comparable to the state of the art.


No Regret Sample Selection with Noisy Labels

Mitsuo, N., Uchida, S., Suehiro, D.

arXiv.org Machine Learning

Deep Neural Network (DNN) suffers from noisy labeled data because of the heavily overfitting risk. To avoid the risk, in this paper, we propose a novel sample selection framework for learning noisy samples. The core idea is to employ a "regret" minimization approach. The proposed sample selection method adaptively selects a subset of noisy-labeled training samples to minimize the regret to select noise samples. The algorithm efficiently works and performs with theoretical support. Moreover, unlike the typical approaches, the algorithm does not require any side information or learning information depending on the training settings of DNN. The experimental results demonstrate that the proposed method improves the performance of a black-box DNN with noisy labeled data.